CAGEF_services_slide.png

Introduction to Python for Data Science

Lecture 05: Flow Control

Student Name: Live Lecture HTML

Student ID: 220210 testing


0.1.0 About Introduction to Python

Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).

0.1.1 Where is this course headed?

We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:


0.2.0 Lecture objectives

Welcome to this fifth lecture in a series of seven. Today we're going to branch off into the wonderful world of flow control and how you can really make your code work for you.

At the end of this lecture we will aim to have covered the following topics:

  1. What is flow control?
  2. Logical, conditional, and comparison operators
  3. For Loops
  4. Conditional Loops

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on using Python lists and the NumPy package

0.4.1 Dataset 1: subset_taxa_metadata_merged.csv

This is our merged dataset from last week. We'll revisit this to play around with looping throught the two-dimensional DataFrame.


0.5.0 Packages used in this lesson

IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.

numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.

pandas provides the DataFrame class that allows us to format and play with data in a tabular format.

time provides various time-related functions.


1.0.0 What is flow control and why is it important?

Flow controls are programs that allow us to repeat a task over and over until there are no more iterations to perform or until a condition that we set is not met anymore. This means that, with a few lines of code, you can perform tasks that otherwise require you copying/pasting your code hundreds or thousands of times.

Flow control is one of the most important skills to have in your computer programming toolbox. All programming languages use them and while the logic behind them is very similar the syntax to write the code differs. Under the hood of Python, its packages, methods and functions all have some form of flow control implemented - especially in the cases where it seems like a single command is accomplishing a lot.

Having a good understanding of data subsetting, logical, conditional, and comparison operators, is critical to writing flow control programs. Thus, we will start off this lecture with a recap of some those concepts from previous lecture.

control.flow.jpg
Comparing the use of flow control for a program or scripts vs a linear sequence of code.
Image from https://codewithlogic.wordpress.com/2013/09/01/python-basics-understanding-the-flow-control-statements/

1.1.0 The benefits of flow control

As we saw last week, for instance, we were able to take advantage of the groupby() method to organize our DataFrame data based on categories. We may wish, for instance to look at these groups indvidiually to decide if they merit further analysis or visualization. If your number of groups is rather small, you could manually curate this data. When dealing with much larger data sets, it would be in your best interests to automate this using the ideas of flow control.

In the above example, we would describe the process as iterating through your DataFrame. At each iteration you are using a branching statement to determine if a primary or secondary analysis should be performed. We'll learn throughout this lecture that there are a number of syntax patterns used to iterate or loop through your data as well as a number of predefined forms of conditional or branching statements. Here's are some helpful tables to summarize what we'll cover today:

Statement Description Syntax
for loop Used to iterate through a range, list, or other iterable data structure from start to end. for item in iterable:
     statement
while loop Used to iterate through a range, list, or other iterable data structure
as long as a conditional expression remains true at the start of each iteration.
while condition:
     statement
if Begin a branching statement to runs specific code if a conditional is met. if condition:
     statement
elif Used to extend the if statement as an alternative condition/action pairing. elif condition:
     statement
else Used as a catch-all action to perform if no conditionals evaluate to true. else:
     statement
break Used to completely exit a looping statement. Usually used within a conditional. if condition:
     break
continue Used to end the current iteration of a looping structure but continue with the next.
Usually used within a conditional.
if condition:
     continue

2.0.0 Logical, conditional, and comparison operators recap

One key aspect about these type of operators is that their output is boolean (True or False), and those outputs can be used to perform a wide range of operations. Here are some comparison operators.

2.1.0 Logical operators

We've already seen the logical operators in previous lectures. They're used to generate logical expressions that we use to filter values or set conditions for further steps. We'll even use these to determine the branching of code (ie Control of flow). Here's a table briefly summarizing these operators:

Operator Description
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
== Equivalent values (but not necessarily equivalent objects in memory
!= Inequality or dissimilar values

These are quite straight-forward to work with for integers or floats.


2.1.1 Logical operators can evaluate strings

The rules for using logical operators on strings is slightly different versus integers. When comparing strings, the following procedure is followed:

  1. Characters are compared by matching indices between strings
  2. When characters are not equivalent, their Unicode value is compared
  3. The character with the lower Unicode value is considered to be smaller
  4. The longer string is considered larger when the character values are equal.

Here's a Unicode table to help us out with our interpretation.

Unicode_Table.jpg

Let's give it a try shall we?


2.1.2 Values being compared must be of the same type

We've seen that we can compare integers with integers and how strings can be compared but we can't simply compare dissimilar object types. So no comparing apples to sheep - they just don't stack up.


2.1.3 Objects must be the same size for comparison

Remember that objects must also be of the same size to complete a comparison using Python's built-in operators. Furthermore, logical comparison between list objects can be complicated. Comparison of lists uses lexicographical order by comparing elements at each index, beginning with index = 0. As elements are compared, they must also follow the previous rules we've outlined.

Read more: If you want to learn more about comparing lists in Python, information can be found here.

If you want to retrieve the results of a proper element-wise comparison you'll have to use something like the Numpy package.


2.1.4 For element-wise comparison, use an array

Recall that the numpy.array object has the ability to broadcast operations and perform mathematical expressions on like-sized arrays. The same applies to conditional expression. By converting our list of numbers to an array object, we can perform conditional expressions against a scalar (single value) or against other arrays. The output, of course, is a boolean array of the original size.


2.2.0 Boolean operators

The boolean operators are used for combining True and False values that can come in various formats. We've already come across some examples last lecture when we were filtering our data. Boolean operators can be used to combine logical expressions, variables, or both. We have four operators at our disposal to combine or compare boolean (logical) and non-boolean (bitwise) values.

Operator Description Evaluation rules
and Logical AND results in True only when all comparisons are True True and True = True
True and False = False
False and False = False
& Bitwise AND compares the binary values of an integer at every bit 1010 1010 & 0101 0101 = 0000 0000
or Logical OR results in False only when all comparisons are False True or True = True
True or False = True
False or False = False
| Bitwise OR compares the binary values of an integer at every bit 1010 1010 | 0101 0101 = 1111 1111

Recall that integers can be converted to booleans with any value other than 0 being considered as True. So we can also use bitwise comparison on these booleans although and and or are more appropriate.

Note also that the logical operators (<, >, ==, etc.) take a higher order precedence than and and not when being evaluated within an expression. Conversely bitwise & and | take higher precedence than the logical operators so appropriate use of parentheses () will be required.


2.2.1 Use the logical not to negate your boolean values

The final operator we'll review is the logical NOT. This is a unary operator that can evaluate a single input and returns the opposite boolean value to that input. This can be used to negate the boolean evaluation from a logical expression. This can be especially useful when generating conditional statements that will determine which parts of your code are run (ie control of flow).


2.2.2 The logical not can be used on non-boolean objects

Beware: the logical not does not apply across all types of objects.

However, a quick and easy way to determine the status of an object is to use the logical not. It will return False unless the object is empty. This can also be a very useful way to determine of a variable has been assigned to a proper object.


Spiderman_True.png
There are a number of way to obtain the same boolean result

Here ends the recap on logical operators. Time to loop! But first...

Section 2.0.0 Comprehension Question: What is the result of using a bit-wise AND on the numbers 5 and 3? What about bit-wise OR on 12 and 7? Use the code cell below to help you work out your answers.

3.0.0 Time for for loops

for loops allow you to iteratively perform operations or data manipulation. Their general structure is

for item in iterable:
     statement

In the above general structure:

In plain English, it means something like "for every item in iterable, do statement until you reach the last element in iterable.

The last thing to note is the indentation on this for loop. Up until now, we have not really been using any tabbed indentation style in our coding. Normally we use tabbed indentation to help make our code more readable ie by indenting the statements inside a for loop.

Python takes this philosophy to the next step by requiring indentation-as-grammer of statements within a control flow structure to be considered part of that structure. We'll see what that means in upcoming examples.

Here are a some definitions of concepts that we will be using today:

Iteration: the repetition of a sequence of computer instructions a specified number of times or until a condition is met1. Iterator: An iterator is an object that contains a countable number of values that can be iterated upon, meaning that you can traverse through all the values2. Iterable: Is an object that has an __iter__ method which returns an iterator. 1https://www.merriam-webster.com/dictionary, 2https://www.w3schools.com/python/python_iterators.asp

3.1.0 Looping over data structures

There are several structures that are iterable, including core Python data structures such as lists, tuples, dictionaries, sets, and the non-core structures such as multidimensional Numpy Arrays and Pandas DataFrames. They are all iterable objects. They are iterable containers because you can retrieve iterators from them (https://www.w3schools.com/python/python_iterators.asp).

3.1.1 Iterators

The job of iterators is to create a "count" or "index" of the elements over which you want to iterate, thus creating a road map to loop over an iterable (a data structure). There are several functions that are meant to be iterators or that can also work as iterators, and which one to use varies with the program that you want to write and what iterator you are working with. Here are some of the most common iterators in Python:

3 A callable is an object that allows you to use round parenthesis ( ). 4 A sentinel value is a condition that indicates the termination of a recursive algorithm.

That's a lot of functions to digest! Let's step back and break down iteration by looking at the different data structures we know.


3.2.0 Iterating through lists

The built-in list structure represents a mutable structure that you will likely work with often to iterate through. Let's iterate through some examples.


3.2.1 Use loops to increment a counter

Sometimes you might want to count the number of iterations that occur within a loop. This may be part of some branching code that we'll look at later. With a simple version of such code you can answer, for example, "how many odd integers are in my list?.

Let's see how a loop can be used to increment a variable's value.


Replace 1 by 30, and item by count in the print function, and run the code again. What do you think this code is doing?


3.2.2 Sum the elements of a list with a for loop

Cumulative summation can be done through for loops although we also have the sum() function to accomplish that. Do you wonder how the sum() function actually works?


3.2.3 Subset using the assigned iterable variable in your loop

Now, lets try to subset using the iterable variable in our for loop. Remember that values will be assigned to an item variable from our iterable with each passing loop.


Python is not happy about it... It says that needs to be integers or slices, so let's try with integers to see how it behaves. Recall what we know about lists. Can we subset or slice a list using string values?


3.2.4 Iterators help to traverse your iterable

Look at what happened above. We ended up accessing and element outside the range of the indices in our list. Even though we had a sequence set of integers, we started at 1 and lists are zero-indexed. Simply using the values from the list isn't the correct way to iterate through it either. What we really want is a list of integer values starting at 0 and going to the length of our list.

Here is where iterators play an important role. Before jumping into iterators, let's see what happens when we pass photosynthesis_type to the len() function.

That didn't work either.


3.2.5 Use the range() function as a for loop interator

Okay, we've looked at a lot of ways on how not to make a for loop. Remember, the for loop by itself does not know what to do with a single integer (the output of len()). Instead, let's use the range() function. Recall that the default behaviour when unary input is provided, is to calculate (0, stop] where we use ( to denote inclusivity and ] to denote exclusivity.

The range() function, of course, returns an iterable. Let's give it a try!


The above TypeError means that range() has no idea what to do with a list if no integers (as indices) are supplied. What if we combine range() and len()?


Now it works. The loop now has an index to iterate ("from 0 to the last item")

So to summarize, we've used a for loop:

What about other data structures?


3.3.0 Looping over one-dimensional Numpy arrays

Recall that Numpy arrays are not built-in data structures. While they share a lot of visual and conceptual similarities to Python list objects they are not the same. Instead they include functions for generating iterators from these Numpy objects. As a note, all packages that produce iterable objects should include basic methods like __iter__ that Python expects to find when provided to something like a for loop.

A numpy.array object returns an iteration that behaves very much like a list so each invididual element is returned in the iterator. That being said, there can be differences in the behaviours between objects and these can factors influence how you should write your code.


3.3.1 Arrays and loops can be combined for broadcasting

Remember that arrays are data structures combined for broadcasting. That means we can do things like multiply across elements, replace or fill multiple elements at once (in the case of DataFrames).

Challenge

Create a for loop that multiplies the first four digits of array_1 by 3. Store each iteration in an object called iteration.


3.3.2 Obtain multiple iterators with the nditer() function

So far we've only been generating a single iterator using the base behaviours of the for loop. However, we can use functions that return multiple iterables to us. In turn that can provide multiple iterators to a loop after assigning them to variables. This idea also occurs when converting dictionary.items() to an iterator (See Section 6: Appendix 1).

For the numpy package, we have a way to produce multiple iterables with the nditer() function. It can take in one or more array objects and return iterables for each - in the form of a tuple OR as separate iterables.


3.3.3 Use nditer() to help broadcast between dissimilar array sizes

Another feature for nditer() is in how it handles the production of iterators for multidimentional arrays and the idea of broadcasting. Suppose instead of two 1x6 arrays, one of our arrays was two-dimensional? With arrays and the right coding, we can broadcast across rows. Just be sure the sizes match properly or you'll receive an error.

Iterating over 2D arrays: While it seems simple enough to iterate over 1D arrays, things can become more complex when working with 2D or higher order arrays. Much of it involves knowing how the arrays were originally loaded into memory. To learn more about how this works AND how to access elements in memory versus row order, check out Section 6: Appendix 1.

3.4.0 Looping over Pandas DataFrames

So we've spent the last 3 lecture touching on or working explicitly with DataFrame objects. How does looping over these compare to lists, or even arrays? Recall these are 2D structures of tabulated data, suggesting there is organization of some sort across rows and columns.

Let's import subset_taxa_metadata_merged.csv as data. As we start looping over the file, we'll also quickly recap on importing and subsetting data frames.


3.4.1 Use for loops to import large datasets in smaller chunks

Assuming that you only need to access parts of a file at one time to gather summary information, you can break down large files that will not fit in memory by importing it in smaller chunks. This saves memory and potentially time as you don't have to wait for the whole file to load. Or if information in the file is treated independently between lines or sections - like large sequencing files, you can work with the data in smaller bites.

Luckily for us, the read_csv() function has a parameter chunksize that we can use to set how many lines we'd like in each chunk. By activating this parameter, the function read_csv() automatically returns an iterable object called a TextFileReader.

Here we will use the concat() method to grow a DataFrame by rows.


3.4.2 Reorganize the columns with pop() and insert()

We'll quickly take our current DataFrame and move the count column over to the second position for easier visibility. We can use the pop() method to remove and retrieve the column, and the insert() method to place it back to where we want it. This will alter the DataFrame object in-place.

The pop() method takes the form of pop(item) where item is the name of the column to be removed.

The insert() method takes the form of insert(loc, column, value) where:

Let's move that column now shall we?


3.4.3 Subsetting DataFrames recap

Recall there are a number of ways to subset a DataFrame object. We'll focus mainly on the multi-indexing methods which include:

3.4.3.1 Additional rules for subsetting DataFrames

  1. When using a list [ ] within the loc[] or iloc[] methods for a DataFrame object, the resulting object returned will also be a DataFrame. Otherwise, it will be a Series object.
  2. Slicing notation with : can be used with loc[] and iloc[] methods
  3. Both loc[] and iloc[] can accept a boolean Series to subset rows from the DataFrame as long as the dimensions correctly match.
  4. Corollary to (3), you can use a conditional expression to filter/subset your data.
  5. Corollary to (4), you can combine conditional expressions to filter/subset your data.
    • Use logical operators & (logical AND) and | (logical OR) to combine element-wise.

For a brief recap of examples, see Section 7.0.0: Appendix 2.


Let's take our current dataset, data, and select only data where the GENUS is Streptococcus with a count > 0.


What if we use and on this data frame and see what happens


3.4.4 Use logical_()* functions as element-wise boolean operators

In the above example we were simply trying to combine the boolean outputs between what would normally be two pandas Series objects. However Python could not determine the truth value of this object. Instead, we need to turn to the function logical_and(), logical_or() and logical_not() to accomplish our task. They have the same behaviour as their Python counterparts but are able to properly handle the multi-dimensional data of these objects.

These functions are also distinguished from the bitwise operators & (AND), | (OR) and ~ (NOT) mainly by their order of precedence. Remember that these operators will take precedence of evaluation over the logical operands (<, >, ==, etc.).

The logical_*() functions, however, are given a list of boolean statements over which they will element-wise evaluate the expression and return a result. These functions are part of the Numpy package and were designed to work specifically with ndarray objects. Recall that the Series class inherits its behaviours from ndarray.

Let's see some follow-up examples.

We know there are streptococcus OTUs with less than 5 and with more than 300 counts. Why are they not showing up in the output?


Select streptococcus with less than 10 counts that do not come from Saliva samples

That is all for the recap on conditional and logical operators. Back to flow control.


3.4.5 The default for loop iterator for DataFrames returns column names

How do we print the first 10 observations from the GENUS column? We already have a number of routes to arrive at this solution but can we accomplish this using a for loop? Let's try the intuitive thing and just provide the DataFrame to the for loop first.


3.4.6 Provide a single DataFrame column to iterate in a for loop

No errors but this is not what we wanted.

Can you identify what is missing in the code? Our call managed to unpack all of the column names in data - not even just the first 10. Python has no idea what we are asking for so it defaults to printing the column names of a data frame.

So we definitely didn't provide Python with the code needed to interpret our intent. Would we be better off just providing a single column? Let's try.


3.4.6.1 Combine your code with the notna() method to further filter your for loop iterator values

As you can see, providing a single column generates an iterator through the elements of the series. That works but it doesn't get us the first 10 valid entries from the GENUS column. Instead, we should filter out missing or NaN values with the notna() method.

To avoid NaN in the output, pass notna() to the data subsetting. At this point our code is getting a little long when perform the subsetting within the for loop. To reduce confusion, your could create a variable before adding it to the for loop.

In addition to the subsetting, sort the data alphabetically in increasing order (from A to Z) after selecting the first 10 values. Let's do this in a couple of steps.


The for loop below prints every genus that is not NaN. Now make the loop look better and easier to debug.


3.4.7 Iterators for rows of a DataFrame

We've seen now that looping through a DataFrame based on column can be straightforward. We often, however, encounter data sets where we want to collate data from multiple columns. In that case, we would want to iterate by rows, through the DataFrame. Be forewarned: this can be both memory intensive and slow once your DataFrame is sufficiently large enough.

There are two methods we can use to proper create iterators of a DataFrame:

  1. The iterrows() method will return a row as a Series object but this can be problematics as the data types must be converted into a single type. This can produce unpredictable or undesired results.

  2. The itertuples() method is much faster and preferable to using iterrows().

We'll focus on using itertuples() to see how exactly we can use it to iterate over our rows. With each pass we'll sum from two variables in the DataFrame: "count" and "VISITNO".


3.4.8 Simply use a range() to iterate through a DataFrame

Sometimes the simplest way to work through your DataFrame object, row-by-row, is by using their index position in combination with a range(). If you don't know what range you want to use, you can retrieve the dimensions of our DataFrame using the shape attribute.

While not quite as clean as using an iterator, it's certainly an approach that will work.


3.4.9 Yes you can use a for loop, but should you?

We've been having fun generating some code that let's us iterate through our objects but do remember that the there are built-in functions for calculating the simpler things in our data structures. Sometimes while we need the practice, it just a matter of efficiency - especially with large data sets.

Let's use a for loop to calculate some summary statistics on our filtered subset - just for practice.

Section 3.4.9 Comprehension Question: Using the code cell below, generate a for loop that can calculate the total from the mean of the VISITNO column of our data_fil dataset.

3.5.0 Use list comprehension to quickly build/manipulate subsets of data

for loops do not always have to be at the beginning of your code section. Instead we can build a for loop directly into a calculation if we want to use it to build a quick iterable for us to evaluate. This can take the form of

newlist = [expression for item in iterable if condition] where:

In the following example, we will calculate the standard deviation of bacterial counts by taking the squared root of variance.

$$\sigma = \sqrt{\frac{\sum(x_i - \bar{x})^2}{N}}$$

We'll build up slowly to get a sense of what's happening.


Now let's do something with our generator by using the pow(value, exponent) function to get the square of the difference between the value and the mean.

We now have the top half of our equation - which really takes care of the list comprehension part for us. We didn't need to filter the data but we could have included a condition like sum(pow(value - mean, 2) for value in data_fil['count'] if value > 0) which would alter our sum total (try it for yourself!)

Now we just need to divide by N and calculate the square root.

Use Numpy's np.std() function to corroborate your result


3.5.1 Create a DataFrame of counts per microbe

Let's take a closer look at our filtered dataset data_fil to generate total counts based on the unique values GENUS values. We'll use a combination of filtering and method chaining. At the end we'll incorporate our values with the zip() method which can combine tuples as columns to help us make a dataframe.

The zip() function will take in multiple iterators and match them together by index to create a series of tuples. Then an iterator of tuples will be returned.

Do you recall another way to do this from last lecture?


3.5.2 Use List comprehension to make the same DataFrame

So in the above, we generated a couple of extra variables, stored the results, and used the zip() function to combine them before making a DataFrame. Now we'll use list comprehension to do the same thing in a "single" line. Here we've broken it into a few lines for readability. Before we try to move ahead, let's break down the components:

  1. A for loop where we produce an iterator of unique genus names.
  2. A filtering step where we by genus to sum the count per genus.
  3. A tuple of (genus, sum) where we match a genus with the sum of it's count variable.
  4. We'll sort the values by the second column and then look at the first 5 entries.

3.5.3 Avoid complex code altogether using built-in methods

So in both of our previous examples we use the for loop in some way to iterate through a list to filter data before summarizing it. Of course we've seen there is an easier way to do this with the proper use of the groupby() method. Let's recount how that can work.


3.6.0 Nested for loops are loops inside loops inside loops...

We've already discussed the concept of nested objects: lists, arrays, dictionaries. A nested for loop is a similar idea: having loops running as statements within your loops. There's no real limit to how many for loops you can nest but if you're deeply nesting for loops, there may be better ways to accomplish your goal.

Now that we have the non-missing data in the form of data_fil, let's creates a similar cumulative sum of counts except on a per genus per body site basis. We'll change it up and build our results with a dictionary this time. It's really quite similar to what we had before except all of the results are stored in a single variable.

We can use the sleep() function from the time module to show in real time what the loop does. We'll just do it on a subset or our data though.


3.6.1 Beware the allure of the nested for loop

Based on the information we have, it appears that we produced exactly what we wanted: 221 unique genera and 5 unique body sites yielding 1105 total combinations. Not so fast though - do all of these combinations truly exist in our dataset? In fact there are only 533 combinations between these two sets. We can prove this by turning again to the groupby() method.

By using the nested for loop we produced combinations that don't exist within our dataset. The problem persists because we take the sum([]) of an empty object, which returns 0 as a value. Thus we end up filling all the values of our DataFrame whether or not they actually exist.


Work smarter, not harder: Existing functions should be your go-to option, either built-in or from a library. Those functions are optimized to get the job done very efficiently in terms of time and computational resources. Why reinvent the wheel?
DoctorWho_NestedLoops_small.jpg
Although for loops are helpful, you may consider another direction if you're nesting too deeply.
https://devrant.com/rants/2230569/ive-seen-people-do-more-than-4-as-well-though

4.0.0 Conditional control flow

We use the term conditionals to denote logical expressions that specifically evaluate to True or False and are used to determine how a program will run. Which set of code will it run next? Will it terminate a loop? This is where we also get the idea of flow control or control of flow.

4.1.0 The if statement executes when the conditional evaluates to True

The purpose of the if control statement is pretty clear. if a condition is met (True), then execute a statement. The following is the general structure of if:

if condition:
    statement

where condition can be a simple logical expression or a complex one involving many of the operators we've already covered. Let's give it a try.


4.1.1 The else statement executes when your if conditional evaluates to False

In the above code we have not set any instruction for when the condition evaluates to False. In that case, the statement line is not evaluated and therefore nothing is printed.

Think of the else statement much like plan B. It allows us to provide a catch-all set of code to run in the case where our conditional has "failed". Let's update our general code structure:

if condition:
    first_statement
else:
    second_statement

Simple, right?

The following code adds a boolean column called abundant to a subset of data called subdata (just for computational efficiency). Every observation (row) where the microbial count is greater than 150 will be classified as "yes" for abundant and "no" otherwise.

We'll also revist or introduce two new concepts:

  1. The DataFrame method itertuples() which returns named tuples of values from our DataFrame. It allows us to iterate over rows as named tuples.
  2. The Unpacking/Packing operator *.

4.1.1.1 The unpacking operator *

Up until now we've used this operator for multiplication and other purposes but when placed directly to the left of an iterable, it can help to unpack the elements for passing on as arguments for a function or passing along the elements of an iterator. Conversely, we can use it as part of a variable assignment to pack or repack an unspecified number of elements into a list as a single variable.

Read more: You can find out more about the packing and unpacking operator with this great tutorial

Let's practice with unpacking and packing before moving forward shall we?


4.1.1.2 Back to using the else statement

Okay let's put that else statement to use now that we can understand the following code.


4.1.2 Use the elif statement if you have multiple conditions to check

Getting the hang of if and else? Next in line is elif. Consider elif an intermediate between if and else. It's literally a portmanteau of else and if which means if you want to check for multiple possible scenarios - usually (but not necessarily) with an order of precedence, then you can use the elif statement to go through that checklist. Let's see how it works.

Based on the microbial count, add a column called treatment that will be either treatment_A, treatment_B, or No action depending on the microbe counts (for the sake of this exercise, let's assume that all microbes have pathogenic potential on humans).

Remember: if a conditional fails, the statement within it will not be executed!


4.2.0 Use while loops when you want to iterate based on a condition

while loops run "while" a condition continues to evaluate to True. At the start of each loop, the condition is re-evaluated before a decision is made. If a for loop and an if statement were to make a weird code-baby, the while loop would be it.

You can use the conditional to iterate in different ways like:

  1. Walking or moving through an iterable
  2. Counting through a specific number of "successful" operations

Let's experiment with how that works shall we?


4.2.1 Make sure your while conditional can evaluate to False

In our first example, there is an eventual end to the list because we are permanently removing items. Therefore the conditional will evaluate to False when the list is empty (ie []).

Our second example, however, requires us to remember to increment our variable z. Since this is quite a simple loop it's not an issue as we always increment the value of z. In other cases with complex branching code with if and/or elif statements you must be careful to check that your conditional will eventually fail.

Let's try another example where we print only those rows from subdata where the conditions is to be either 'Streptococcus' or 'Lactobacillus'

Now we have a list of two: "strept" and "lactobac". We can use a for loop to unlist them, then use Pandas' concat() to join them into single data frame


4.3.0 Advance through an iterator without looping using the next() function

Recall from the lecture 03 appendix that for each iterator, we can use the next() function to retrieve the next item in the queue, recalling its place in the queue. This continues until the last element is evaluated and then the iterator is empty.

If you try to go past the last element, Python will provide a StopIteration error to let you know you've gone too far.

Let's practice with the next() function.


4.4.0 break, and continue can interrupt loops

Sometimes you may be looping through with a for or while loop when an unexpected condition occurs. Perhaps you wanted to error-proof your code or need to exit a loop based on internal conditions encountered while examining your data. Sometimes you may have a last-ditch conditional to prevent yourself from iterating too many times, or even endlessly.

When you need to explicitly exit a loop, you can use the break command. This will end the loop without further repetition.

Alternatively, you may have a long series of code that you don't want to even bother evaluating with more conditionals (to save on processing power for instance). You can end the current iteration of a loop and begin the next using the continue command.

Let's work through a few examples

Section 4.0.0 Comprehension Question: Look at the following code cell. Without running the code, explain what you think will happen to this code. How many times will it loop before stopping? What will the output look like? What do you think might be wrong with thise code segment?
counter = 10 while counter > 0: print("still going") if counter < 2: counter = counter + 1 continue counter = counter - 1 print("done")

Answer:

if_elif_while_everywhere.png
That's right, you can nest control statements inside control statements of any kind really...

5.0.0 Class summary

That's our fourth class on Python! You've made it through and we've learned about a number of logical expression operators and how to apply them in loops and filtering data:

  1. Flow control
  2. Logical, conditional, and comparison operators
  3. For loops
  4. Conditional control flow

5.1.0 Submit your completed skeleton notebook (2% of final grade)

At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).

5.2.0 Post-lecture DataCamp assessment (8% of final grade)

Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 3-5 (Logic, Control Flow and Filter, 1500 possible points; Loops, 1450 possible points; and Case Study, 1200 possible points) from the Intermediate Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 3112 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:

DataCamp.example.png
A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible

Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, February 17th to submit your assignment (right before the next lecture).


5.3.0 Acknowledgements

Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.4.0 Resources


6.0.0 Appendix 1: More on looping

6.1.0 Iterating through dictionaries

Recall that dictionaries consist of key:value pairs and that unlike lists, they have no index. Instead, they are accessed by providing a matching key. When we provide a dictionary object to a for loop it will return an iterator to its hash/keys.

Let's revisit our amino acid dictionary from lecture 2.


6.1.1 Iterate through your dictionary values by indexing with the hash

Now that we know we can get the key information in our for loop, we can use that much like we did with our list examples to iterate through the value information stored in the dictionary.


6.1.2 Use the attributes of a dictionary as an iterator

Like the dictionary, you can also iterate through its attributes like the keys and values. Remember that we can use methods from the dictionary object to return this information for us. There are three methods we can use for this purpose: keys(), values(), and items().


Or get the whole dictionary


Each key:value pair is printed as a tuple.


6.1.3 Use the for loop to assign multiple variables from your iterator

Knowing that the item() method returns a tuple object from our dictionary - specifically with two elements, can we take advantage of that information? Rather than index the information from the tuple, let's try to assign multiple variables to the elements from our tuple with the for loop itself.

As you can see, trying to assign beyond the number of values available will result in an error.


6.2.0 Iterating through Numpy 2D-arrays using nditer()

From our previous examples with arrays, it looks like iterating through a 1D array seems pretty straight-forward. Iteration over 2D Numpy arrays, however, is slightly more complex than with 1D counterparts if you are re-arranging it on the fly.

"An important thing to be aware of for this iteration is that the order is chosen to match the memory layout of the array instead of using a standard C or Fortran ordering. This is done for access efficiency, reflecting the idea that by default one simply wants to visit each element without concern for a particular ordering. We can see this by iterating over the transpose of our previous array, compared to taking a copy of that transpose in C order." https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.nditer.html

What does all that mean? In simpler terms the iterator for an array uses the same order as it is stored in memory regardless of the shape the array may be in. Let's see how that plays out in practice


6.2.1 Use the order parameter to override how iterator elements are made in nditer()

See how the process of copying the array has re-arranged it's elements in memory as well?

You don't necessarily want to copy your objects every time you want to move through them after transposing or reshaping them. Instead you should look to the specific parameters of nditer() of which include the order parameter which takes on the values of:

Read more: For more information on this parameter check out the documentation

7.0.0 Appendix 2: Subsetting DataFrames

Below you'll find some example code for subsetting DataFrame objects. Recall some of our rules involving subsetting DataFrame objects include:

  1. Using a list to subset one or more columns returns a DataFrame.
  2. The loc[] and iloc[] methods can be used to subset both a row and column range, both of which are also amenable to slicing notation.
  3. The loc[] and iloc[] methods accept boolean Series or conditional expressions or a mixture of both. All Series and expressions must have the same number of elements as there are rows in the DataFrame.
  4. DataFrames can be subset through a series of chain indexing ie [row][col][conditional_expression] but you will be returned a copy of the data and not access to the original.

The next piece of code is not going to work. Can you tell why?


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.

CAGEF_new.png